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Abstract 

Motivated by the success of data-driven convolutional 
neural networks (CNNs) in object recognition on static im¬ 
ages, researchers are working hard towards developing 
CNN equivalents for learning video features. However, 
learning video features globally has proven to be quite a 
challenge due to its high dimensionality, the lack of labelled 
data and the difficulty in processing large-scale video data. 
Therefore, we propose to leverage effective techniques from 
both data-driven and data-independent approaches to im¬ 
prove action recognition system. 

Our contribution is three-fold. First, we propose a 
two-stream Stacked Convolutional Independent Subspace 
Analysis (ConvISA) architecture to show that unsuper¬ 
vised learning methods can significantly boost the perfor¬ 
mance of traditional local features extracted from data- 
independent models. Second, we demonstrate that by learn¬ 
ing on video volumes detected by Improved Dense Trajec¬ 
tory (IDT), we can seamlessly combine our novel local de¬ 
scriptors with hand-crafted descriptors. Thus we can utilize 
available feature enhancing techniques developed for hand¬ 
crafted descriptors. Finally, similar to multi-class classifi¬ 
cation framework in CNNs, we propose a training-free re¬ 
ranking technique that exploits the relationship among ac¬ 
tion classes to improve the overall performance. Our ex¬ 
perimental results on four benchmark action recognition 
datasets show significantly improved performance. 


1. Introduction 

Despite a long history of prior work, action recognition 
in videos, especially unconstrained videos that have large 
visual and motion variation, remains a very challenging 
task. Recent progress on this problem mainly relies on im¬ 
provements of features, which can be categorized into two 
broad categories: 1) more traditional hand-crafted local fea¬ 
tures [38, 35] and their corresponding bag-of-feature (BoF) 
encoding methods [25], and 2) learning based features that 
are mainly inspired by the success of convolutional neural 
networks (CNNs) for image recognition [ 6, 31, 15] and 



Figure 1: Illustration of our novel local video descriptors. 
LOP and LOF describe gray pixel and optical flow vol¬ 
umes, respectively. They resemble HOG/HOF/MBH in a 
data-driven learning framework. 


recurrent neural networks (RNNs) for speech recognition 
[7, 8, 22], In this paper we combine the merits of both 
classes of methodologies. 

Trajectory based features, especially Improved Dense 
Trajectories (IDT) [37], are state-of-the-art hand-crafted 
features that have dominated action recognition on videos 
in recent years. Compared with other hand-crafted mo¬ 
tion features, IDT performs better in that it models long 
term motion information and has a motion boundary de¬ 
scriptor (MBH) that is robust to camera motion. This long¬ 
term motion information modeling, as shown in [15, 31], 
is very hard to learn in a CNN framework. Despite its su¬ 
periority, IDT still relies on simple hand-crafted local de¬ 
scriptors such as Histogram of Gradient (HOG) and His¬ 
togram of Optical Flow (HOF) [20] that took years of effort 
to develop. In contrast, for image and speech recognition 
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[16, 22], data-driven approaches have demonstrated their 
superiority and gradually replaced the traditional hand¬ 
crafted methods. 

These revolutionary changes are largely enabled by the 
availability of neural networks algorithms, large scale la¬ 
belled data and powerful parallel machines. Learning video 
features for action recognition, however, has proven to be 
quite a challenge due to its intrinsically high dimensional¬ 
ity, the lack of training data and the difficulty in processing 
large-scale video data [15, 31, 22], With limited training 
data and computational power, the learned features are gen¬ 
erally not discriminative enough and perform worse than 
IDT, especially on those datasets that have few training in¬ 
stances. Recent approaches [31, 22] circumvent these prob¬ 
lems by learning on sampled frames or very short video 
clips and/or using weakly labeled data. However, video 
level label information can be incomplete or even missing at 
the frame/clip level and leads to the problem of false label 
assignment, which can be even worse for weakly labelled 
data [15], In other words, the imprecise frame/clip-level la¬ 
bels populated from video labels are too noisy for learning 
powerful models. With better labelled data, neural network 
algorithms would give superior results. Unfortunately, ac¬ 
curately labeling video data is very expensive. 

Though we see the value in developing fully automat¬ 
ically learned global video features using limited train¬ 
ing data, we propose in this paper to revisit the tradi¬ 
tional local feature pipeline and combine the merits of 
both data-independent and data-driven approaches. In¬ 
spired by the two-stream ConvNet [31], we introduce a two- 
stream Stacked Convolutional Independent Subspace Anal¬ 
ysis (ConvISA) architecture to learn both visual appearance 
and motion information in an unsupervised way. As shown 
in Figure 1, instead of learning on frames/short video clips, 
we learn on much smaller primitives-video volumes that 
follow the trajectories detected by IDT. The learned descrip¬ 
tors, called LOP (Learned descriptors of Pixel) and LOF 
(Learned descriptors of optical Flow), aim to improve the 
best performing hand-crafted descriptors within an unsuper¬ 
vised data-driven learning framework. The proposed archi¬ 
tecture has several attractive properties: 

• Compared to full video learning, small video volumes 
lie in a much lower dimensional space hence are com¬ 
putationally efficient to learn. 

• By doing unsupervised learning, we avoid the costly 
work of collecting labelled data and the false label as¬ 
signment problem in current supervised video learning 
settings. 

• By learning on video volumes defined by IDT, the re¬ 
sulting descriptors can seamlessly combine with and 
boost the performance of hand-crafted descriptors. 


• By following the traditional local feature pipeline, we 
can easily utilize techniques developed for traditional 
local descriptors to improve our data-driven descrip¬ 
tors. 

Another merit CNN approaches have is that they natu¬ 
rally capture the relationships among action classes, which 
are hard to be captured by traditional ‘one-versus-alT sup¬ 
port vector machine (SVM) framework. To address this 
problem, we design a Multi-class Iterative Re-ranking 
(MIR) method. MIR requires no training yet can signifi¬ 
cantly improve the ranking performance measured by mean 
average precision (MAP). To make this technique useful for 
improving classification accuracy, we also propose a simple 
ranking-score fusion technique. 

Our experimental results on four benchmark action 
recognition datasets show very competitive performance 
to the state-of-the-art results. Specifically, our methods 
achieve or exceed the state-of-the-art on HMDB51, Holly - 
wood2, UCF101 and UCF50 datasets. 

In the remainder of this paper, we provide more back¬ 
ground information about video features with an emphasis 
on recent attempts on learning with deep neural networks. 
We then describe IDT and stacked convolutional ISA in de¬ 
tail. After that, we evaluate our methods. Further discus¬ 
sions including potential improvements are given at the end. 

2. Related Work 

In conventional video representations, features and en¬ 
coding methods are the two chief reasons for considerable 
progress in the field. Among them, the trajectory based ap¬ 
proaches [21, 34, 35, 37, 14], especially the Dense Tra¬ 
jectory (DT) method proposed by Wang et al. [35, 37], 
together with the Fisher Vector encoding [26] are the ba¬ 
sis of the current state-of-the-art algorithms. Peng et al. 
[24] further improved the performance of IDT by increas¬ 
ing the codebook sizes and fusing multiple coding methods. 
Sapienz et al. [30] explored ways to sub-sample and gen¬ 
erate vocabularies for Dense Trajectory features. Hoai & 
Zisserman [ 1 0] achieved the state-of-the-art on several ac¬ 
tion recognition datasets by using three useful techniques 
including data augmentation, learning classifiers to model 
score distribution over video subsequences and learning 
classifiers to capture the relationship among action classes. 
Lan et al. [18] proposed a multi-skip feature stacking 
method that combines features at multiple frame rates to 
achieve temporal invariance for action features. They also 
proposed to attach normalized 3D location information to 
the local descriptors and achieve good results on several 
action benchmark datasets. Fernando et al. [5] proposed 
to model the evolution of appearance within the video and 
achieved state-of-the-art results on the Hollywood2 dataset. 

Independent Component Analysis [12] (ICA) was the 


first approach to learn representations of videos in an un¬ 
supervised way. Le et al. [19] approached this problem us¬ 
ing stacked ConvISA, but they only model the appearance 
information without using optical flow information, which 
has struggled to capture motion information [15, 31]. Gen¬ 
erative models for understanding transformations between 
pairs of consecutive images are also well studied. Recently, 
Ranzato et al. [27] proposed a generative model for videos. 
The model uses a recurrent neural network to predict the 
next frame or interpolate between frames. 

More recently, some success has been reported using 
deep CNNs for action recognition in videos. Karpathy et 
al. [15] trained a deep CNNs using 1 million weakly la¬ 
beled YouTube videos and reported a moderate success on 
using it as a feature extractor. Simony an & Zisserman [31] 
reported a result that is competitive to IDT [37] by train¬ 
ing deep CNNs using both sampled frames and stacked op¬ 
tical flows. Srivastava et al. [33] tried unsupervised fea¬ 
ture learning using long-short term memory (LSTM) but get 
worse results than supervised feature learning methods. 

As for modeling the relationship among classes, Berg¬ 
amo & Torresani [ I ] proposed a Meta-class method to iden¬ 
tify related image classes based on misclassification errors 
on a validation set. Hou et. al. [11] proposed a method to 
identify similar class pairs and group them together to train 
‘two versus rest’ classifiers. By combining ‘two versus rest’ 
with ‘one versus rest’ classifiers, they observed significant 
improvements from baselines. Hoai & Zisserman [10] pro¬ 
posed to learn the correlation and exclusion between action 
classes by learning to combine the base classifier’s predic¬ 
tion and sorted predictions from other classifiers. Unlike the 
aforementioned approaches that require learning and only 
modify the predictions once, our method is training free and 
iteratively updates the prediction scores given the better pre¬ 
diction scores in the previous iterations. 

3. Improved Dense Trajectory 

IDT improves DT feature [35] by explicitly estimating 
camera motion and removing trajectories generated by cam¬ 
era motion. In this section, we first describe how trajectories 
are generated and then the original hand-crafted descriptors. 

3.1. Dense trajectories 

As illustrated in Figure 1, trajectories are generated by 
tracking feature points across multiple frames. Gray pixel 
and optical flow volumes are extracted along the trajectories 
and represented by different kinds of descriptors. Follow¬ 
ing [35], our trajectories are extracted from multiple spa¬ 
tial scale with a step size of 1 /\/2 and a scale number of 
8. Feature points are densely sampled with a step of size 
5 pixels and tracked in each scale separately. Specifically, 
each feature point f t = (a;*, yt) at frame t is tracked to the 
t + 1 frame by median filtering in a dense optical flow field. 


Mathematically, 

ft+ 1 = (®t+i) Vt+i) = ( x u y t ) + (M *u)\(x t ,y t ) (1) 

where M is the median filtering kernel, ( Xt,yt ) is the 
rounded position of ( Xt,yt ) and ui = ( ut,vt ) is the dense 
optical field calculated by the Farneback algorithm [4], As 
can be seen, this tracking algorithm is computationally effi¬ 
cient given the optical flow field. The tracked points are put 
together to define a trajectory: (f t , f t+1 , f t+2 , ..f t +i-i), 
where l is the length of the trajectory. Wang et al. [35] 
experimentally show that l = 15 gives good results across 
several benchmark datasets and longer trajectories may suf¬ 
fer from the drifting problem. Static trajectories are non- 
informative and hence removed. 

3.2. Hand-crafted descriptors for IDT 

There are four hand-crafted descriptors for IDT. Among 
them, trajectory shape, which encodes the relative locations 
of the trajectories, can directly get from f t . The other three 
descriptors are Histogram of Gradient (HOG), Histogram 
of Optical flow (HoF) and MBH. IDT computes HOG/HOF 
along the dense trajectories. For both HOG and HOF, full 
orientations are quantized into 8 bins with an additional 
zero bin for HOF. Both descriptors are normalized with 
their £ 2 -norm. The MBH descriptor separates the optical 
flow field into its x and y components. Spatial derivatives 
are computed for each of them and orientation information 
is quantized into histograms, similar to the HOG descrip¬ 
tor. For each component, there is a 8-bin histogram nor¬ 
malized by its £ 2 -norm. Since MBH represents the gradi¬ 
ent of the optical flow, constant motion information is sup¬ 
pressed and only information about changes in the flow field 
is kept, which is a simple and effective way to remove cam¬ 
era motion. These histogram-based descriptors are com¬ 
puted within space-time volumes aligned with a trajectory 
to encode the appearance and motion information. The size 
of the volume is s x s pixels and l frames long, which 
corresponds to the input size of stacked ConvISA. To em¬ 
bed structure information, the volume is subdivided into a 
spatio-temporal grid of size s T x s T x l v . In this paper, 
we fix the size of volume and grid as in IDT, i.e., s = 32, 
l = 15, s T = 2 and l„ = 3 as in [35], 

4. Learning Descriptors for IDT 

Although IDT is the current state-of-the-art action recog¬ 
nition feature, it still relies on hand-crafted descriptors. In 
this section, we will describe how we use Stacked Con¬ 
vISA to learn descriptors for IDT. By learning descriptors 
for IDT, we also aim to generalize the best performing hand¬ 
crafted features within a data-driven learning framework. 


4.1. Stacked ConvISA 


ISA is an unsupervised learning algorithm that learns 
features from unlabeled image patches. An ISA network 
[13] can be described as a two-layered network. More pre¬ 
cisely, let matrix W £ R mxn and matrix V £ R dxm de¬ 
note the parameters of the first and second layers of ISA 
respectively, n is the dimension of the input matrix and d 
is the dimension of outputs of ISA. m is the number of la¬ 
tent variables between the first layer and the second layer. 
Typically d < m < n. The matrix W is learned from 
data with orthogonal constraint WW T = I. Therefore 
we call W the projection matrix. The matrix V is given 
by the network structure to group the output variables of 
the first layer. Vj; 7 = 1 if the j-th output variable of the 
first layer is in the i-th group, otherwise V 7J = 0. There¬ 
fore we call V the grouping matrix. Given an input pattern 
X 4 £ R n , the activation of *-th output unit of the second 
layer is Pi(X t ; W. V) defined by 


Pl (X*; W, V) 4 V V ik (£ W kj X)Y . (2) 

\ k-\ j—l 

The ISA enforces the activation of the output unit to be 
sparse. To achieve the sparse activation, it minimizes the 
loss function defined on T training instances: 

T d 

min ^^P,;(X*: W. V .) (3) 

t= 1 2=1 

s.t. WW T = I. 


Another way to interpret ISA is from sparse coding frame¬ 
work. Let Q = [G i . Q-> - ■ ■ ■ , Gd] denote the variable group 
indexes defined by V, that is, j £ Gi if and only if Vj, 7 = 1. 
G, defines group size, which is generally the same across 
groups. 

As in group LASSO [6], for any vector a £ R m , we 
defined the group fi-norm Hajlp,! as 


d 


Nki = £ 

i= 1 

We can write Pi(X t ; W. V ) as 



p i (X t -,W,V) = \\WX t \\g A . 


Denote at t = WX*, since WW T = I , we have 
X* = W^OLt , 

where is the Moore-Penrose pseudo inverse of W. Eq. 
(3) can be re-formulated as a sparse coding method that 

T 


min 


s.t. 


IMIe.i 

t= 1 

(kL t ) T PL t = I 


(4) 

X* = W^a t 


Based on Eq. (4), ISA is essentially searching a group- 
sparse representation ot t of the input signal X t . The ma¬ 
trix H 7 ' is the dictionaries of sparse coding. The orthogonal 
constraint of IT ' makes the learned components maximally 
independent. 

On high dimensional data, ISA is computationally ex¬ 
pensive to run. In ISA, we usually set m = 0{n). The 
optimization defined in Eq. (3) will be computationally 
expensive for high dimensional data since its space com¬ 
plexity is 0(n 2 ) and time complexity is ()( n :i ). In order to 
handle this problem, Le et al. [19] proposed a stacking ar¬ 
chitecture that progressively makes use of PCA and ISA as 
sub-units for unsupervised learning. The key ideas of this 
approach are as follows. An ISA algorithm on small input 
patches was first trained. This learned network is then used 
to convolve with a larger region of the input image with a 
fix stride (step size). The combined responses of the convo¬ 
lution step are then given as input to the next layer which is 
also accomplished by another ISA algorithm with PCA as 
a preprocessing step. Similar to the first layer, PCA is used 
to whiten the data and reduce their dimensions such that the 
next layer of the ISA algorithm only works with low di¬ 
mensional inputs. The stacked model is trained greedy lay- 
erwise in the same manner as other algorithms proposed in 
the deep learning literature [9]. Applying the models above 
to the video domain is rather straightforward: the inputs to 
the network are 3D video blocks instead of image patches. 
More specifically, a sequence of image patches are flatten 
into a vector. This vector becomes input features to the net¬ 
work above. The detail algorithm of stacked ConvISA can 
be found in Algorithm 1. The parameters are set as in [ 19]. 


4.2. Two stream stacked ConvISA 

The initial stacked ConvISA was designed to learn ap¬ 
pearance and motion together, but as shown in [15], spatio- 
temporal features learned from fixed size cuboids do not 
capture the motion part well [31], Inspired by two stream 
ConvNet[31], we propose a two-stream stacked ConvISA 
that learn appearance features and motion features sepa¬ 
rately. The appearance stream learns on gray scale video 
volumes that follow the trajectories detected by IDT and the 
motion stream learns to describe the corresponding optical 
flow volumes. In this paper, we use a two-layer stacked 
ConvISA as in [19], The inputs of the stacked ConvISA 
are 32 x 32 x 15 video volumes, within which we subsam¬ 
ple 16 x 16 x 5 video volumes to train the first layer of 
stacked ConvISA. Figure 2 shows some randomly chosen 
filters learned from two stream stacked ConvISA. As can 
be seen, filters from optical flow have much more complex 
shapes than filters from gray pixels, which indicates that 
motion signals are more difficult to learn. 





Algorithm 1 Stacked ConvISA 

1: Input: Sample T trajectory volumes X = 

[X 1 , X 2 , • • • X t , • • • , X T ] from videos, where 
£ 2^32x32x15 

2: -First layer network- 

3: Uniformly sample T sub-volumes X = 
[X 1 , X 2 , ■ ■ ■ , X 4 , • ■ ■ , X T ] from X, where 
t £ ^16x16x5 

4: Train PCA on X with PCA dimension of 300: 

pca_model| =PCA train ({X 1 , • • • , X T }, 300) 
Xpca 1 —PCA a ppjy (pca._model^, X) 

5: Train ISA on X P ca 2 with ISA group size 1: 

isa_modeli =ISAt ra i n (Xp Cai , 1) 

6: Convolve X with first layer’s model: 

Xpca 1 PC A a ppiy (pca_model-^, X) 

Xjsa i —ISA a ppiy (isa_modeli, Xp Ca>i ) 

7: Using XigA x as the output of the first layer network : 


X|ayer l — [ A [ g A i 

8: -Second layer network- 

9: Train PCA on XL aye ri with PCA dimension of 200: 

pca_model 2 =PCAt ra i n (Xi a y er i, 200) 

-Cp f ;a 2 — PC A a ppiy (pca_model 2 , X] a y er |) 

10: Train ISA on X pca2 with group size 2: 

isa_model 2 =ISAt ra in(Xp Ca2 ,2) 

X j s a 2 IS A a ppiy (isa_model 2 , Xp Ca2 ) 

11: Stack the top 100 dimension of X pca , 2 with Xisa 2 as 
the output XL ayer2 of the second layer network : 

XLa,yer2 = [X pca , 2 (l : 100 );Xi S A 2 ] 


12: Output: X La yer 2 


5. Multi-class Iterative Re-ranking (MIR) 

Algorithm 2 shows our MIR algorithm. Given a score 
matrix P £ M. NxK that contains K classifiers’ predictions 
on N instances. We update each score Pij iteratively by 
looking at other classifiers’ predictions on the same instance 
and reducing the score using the predictions from the other 
classifiers. The reduction is carried out by first sorting other 
classifiers’ predictions {P-™\pj; 2 \ • ■ ■ , i n a 



(b) Filters learned from optical flow 


Figure 2: Example filters learned using two-stream stacked 
ConvISA. 


Algorithm 2 Multi-class Iterative Re-ranking 
1: Input: The prediction scores of K class P £ R' iVx K \ 
Re-ranking annealing parameter //; Re-ranking weight¬ 
ing coefficient a > 0; Total iteration steps W. 

2: Init:P(°) = P 

3: for w = 1,2, • • • , W — 1 do 

4: For any instance index i £ {1,2, ,N} and class 

index j e {1,2, • • • , K} , 


A‘? = sortHP'U rg>, ■ ■ ■, f’S}\d” , 4) 


p(w+l) __ p(w ) _ w-1 

r " " r i.j 'I 




K 

E 

r— 1 


e~ ar A g } (r) 


5: end for 
6 : Output: P<W) 


descending order and then subtracting the weighted sum of 
the sorted scores from Pi j. We use exponential decaying 
weight and the weighting coefficient a has been set to 1 
throughout the paper. MIR assumes that most instances 
only belong to one class. We acknowledge that this assump¬ 
tion is quite strong in most real-world applications. How¬ 
ever, our experiments show that even with this assumption, 
MIR is quite useful for improving ranking performance for 
real-world data where each instance can have multiple la¬ 
bels. We found algorithm 2 only improve the ranking per¬ 
formance but not classification accuracy. To make this im¬ 
provement on ranking be beneficial for classification accu¬ 
racy, we propose to fuse the improved ranking from MIR 
with the original scores. Specifically, after getting the im¬ 
proved ranking matrix I’n from P(w), we normalize Pr by 
scaling between 0 and 1 and do a z-score normalization to 
get Pr, which becomes comparable to the original score 
matrix P. Our final prediction would be P = Pr + P. 

6. Experiments 

We examine our the proposed LOP and LOF descriptors 
on several activity recognition datasets, predominately in- 









volving actions. The experimental results show that the new 
descriptors, together with several existing feature enhancing 
techniques and MIR, can improve IDT representations sig¬ 
nificantly. 

6.1. Experimental setting 

As in [37] and [19], IDT features are extracted using 15 
frame tracking, camera motion stabilization and RootSIFT 
normalization and described by Trajectory, HOG, HOF, 
MBH, LOP and LOF descriptors. Stacked ISA models are 
trained on 200000 video volumes. PCA is used to reduce 
the dimensionality of descriptors by a factor of two. For 
Fisher Vector encoding, we map the raw descriptors into a 
Gaussian Mixture Model with 256 Gaussians trained from a 
set of randomly sampled 256000 data points. Power and i 2 
normalization are also used before concatenating different 
types of descriptors into a video based representation. For 
classification, we use a linear SVM classifier with a fixed 
C = 100 as recommended by [37] and the one-versus-all 
approach is used for multi-class classification scenario. We 
use two feature enhancing techniques developed by Lan et 
al. [18] to show that improvements developed for hand¬ 
crafted descriptors can also be used for our learned descrip¬ 
tors. These two techniques are is 3D location extension 
(xyt-extension) and Multi-skip Feature Stacking (MIFS). 
MIR is used after xyt-extension and MIFS are applied. 

6.2. Datasets 

Four representative datasets are used: The HMDB51 
dataset [17] has 51 action classes and 6766 video clips ex¬ 
tracted from digitized movies and YouTube. [17] provides 
both original videos and stabilized ones. We only use origi¬ 
nal videos in this paper. The Hollywood2 dataset [20] con¬ 
tains 12 action classes and 1707 video clips that are col¬ 
lected from 69 different Hollywood movies. We use the 
standard splits with training and testing videos provided by 
[20], The UCF101 dataset [32] has 101 action classes span¬ 
ning over 13320 YouTube videos clips. We use the stan¬ 
dard splits with training and testing videos provided by [32], 
The UCF50 dataset [29] has 50 action classes spanning over 
6618 YouTube videos clips that can be split into 25 groups. 
The video clips in the same group are generally very simi¬ 
lar in background. Leave-one-group-out cross-validation as 
recommended by [29] is used. We report mean accuracy 
(MAcc) for HMDB51, UCF101 and UCF50 and mean av¬ 
erage precision (MAP) for Hollyowood2 datasets as in the 
original papers. 

6.3. Evaluation of our learned IDT descriptors 

In this section we compare hand-crafted descriptors with 
our learned descriptors as well as their combination. To 
compute the descriptors, we follow the parameter settings 
as in [35] and [19]. That is, s = 32, s T = 2,1^ = 3 for 



HMDB51 

Hw2 

UCF101 

UCF50 

Traj 

31.9 

42.7 

55.2 

69.3 

HOG 

42.0 

47.4 

72.4 

77.5 

LOP 

47.2 

54.3 

79.3 

83.2 

HOF 

49.8 

55.0 

74.6 

86.1 

MBH 

52.4 

60.8 

81.4 

87.0 

LOF 

51.0 

55.4 

81.2 

86.8 

Hand crafted 

59.1 

64.5 

85.5 

90.2 

Traj + Learned 

56.1 

61.1 

85.9 

89.3 

Hybrid 

62.3 

65.5 

87.5 

92.0 


Table 1: Comparison of our novel descriptors with hand¬ 
crafted descriptors. We report MAcc over all classes for 
HMDB51, UCF101 and UCF50 datasets, and MAP for Hol- 
lywood2 (Hw2). Traj means trajectory shape descriptor; 
Hand crafted contains four hand-crafted descriptors includ¬ 
ing Traj, HOG, HOF and MBH, Learned indicates LOP + 
LOF and Hybrid contains all six descriptors. 

all descriptors. We fix the trajectory length to l = 15. We 
use d\ = 300 as the dimension of first layer ISA outputs 
and d ,2 = 200 as the dimension of the second layer ISA 
outputs. In section 6.7, we will show that these parameters 
experimentally give good performance cross datasets. 

Results of the four datasets are presented in Table 1. 
Overall, our learned descriptors can boost the performance 
of IDT by 1% to 3%. Comparing LOP to HOG, we can see 
that LOP is doing much better than HOG in capturing ap¬ 
pearance information, resulting in 3% to 8% improvement. 
Our LOF descriptor is better than HOF, but is still worse 
than MBH. These results show that even learning on video 
volumes defined by trajectories, motion information is hard 
to capture and represent. When comparing the combination 
of hand-crafted features and the combination of learned fea¬ 
tures, our learned features are generally worse than hand¬ 
crafted features except UCF101 datasets, but when com¬ 
bining all descriptors, our learned descriptors can boost 
the performance significantly. These improvements demon¬ 
strate that unsupervised learning methods have the potential 
to significantly boost the performance of hand-crafted fea¬ 
tures. 

6.4. Feature enhancing techniques 

We apply two feature enhancing techniques by Lan et 
al. [18] to improve the performance of our hybrid descrip¬ 
tors that including hand-crafted descriptors and learned de¬ 
scriptors. The first technique is 3D location extension (xyt- 
extension), which augments the descriptors with 3D nor¬ 
malized location information. Another technique is Multi¬ 
skip Feature Stacking (MIFS), which stack raw features ex¬ 
tracted from videos with different frame rates before en¬ 
coding. Table 2 summarized the results. Overall, both xyt- 













HMDB51 

Hw2 

UCF101 

UCF50 

xyt-extension 

63.9 

67.3 

88.2 

94.1 

MIFS 

64.0 

67.5 

88.5 

94.5 

Combined 

66.5 

68.8 

89.7 

95.1 


Table 2: Comparison of different feature enhancing tech¬ 
niques for our hybrid descriptors. We report MAcc over all 
classes for HMDB51, UCF101 and UCF50, and MAP for 
Hollywood2 (Hw2). 



stride stride 


(a) HMDB51 (b) Hollywood2 

Figure 3: Effect of stride on performance 


Iteration 

HMDB51 

Hw2 

UCF101 

UCF50 

0 

66.8 

68.8 

91.2 

96.5 

1 

69.3 

71.1 

93.4 

97.6 

2 

69.9 

71.6 

94.0 

98.0 

3 

70.2 

71.9 

94.2 

98.1 

4 

70.3 

71.9 

94.4 

98.1 

5 

70.3 

71.9 

94.4 

98.1 



(a) HMDB51 


(b) Hollywood2 


Table 3: MIR iteration number versus MAP performance on 
four datasets we tested. Iteration 0 means the original rank¬ 
ing without MIR. For Hollywood2 (Hw2) and UCF50, MIR 
converges at 3rd iteration and for HMDB51 and UCF101, 
MIR converges at 4th iteration. 

extension and MIFS help to improve the performance of our 
hybrid descriptors on all four datasets. These improvements 
show that by learning on local video volumes defined by 
IDT, we can easily utilize available techniques developed 
for hand-crafted features, which would be difficult or not 
possible for global feature learning to do. 

6.5. Multi-class iterative re-ranking (MIR) 

In this section, we report results of our MIR technique 
on the combined results in section 6.4. Table 3 shows 
the number of iteration versus MAP performance on four 
datasets we tested. As can be seen, on all four datasets, 
MIR finishes within 4 iterations and converges to results 
that are significantly better than the baselines, which are 
shown in iteration 0. These improvements show that al¬ 
though MIR may be based on strong assumptions, it is still 
quite useful for real-world data. These improvements are 
especially seen with Hollywood2 dataset, whose source is 
movie data where multiple labels exist for some instances. 
For HMDB51, UCF101 and UCF50, after fusing the score 
with ranking, we get slight improvements on MAcc, from 

66.5, 89.7 and 95.1 to 67.0, 90.2 and 95.4, respectively. 

6.6. Comparison to the state-of-the-art 

In Table 4, we compare our best results with the state-of- 
the-art approaches. From Table 4, in most of the datasets, 
we observe improvements over the state-of-the-art. Note 
that although we list several most recent approaches here for 
comparison purposes, most of them are not directly compa- 


Figure 4: Size of receptive field versus performance 



Dimension of Descritpors Dimension of Descritpors 


(a) HMDB51 (b) Holly wood2 

Figure 5: Dimension of descriptors versus performance. 

rable to our results due to the use of different data augmen¬ 
tation, features and representation methods. For example, 
Hoai & Zisserman [ 1 0] and Fernando et al. [5] use left-right 
mirrored video data augmentation to get about 2% improve¬ 
ments on Hollywood2 dataset, which can be easily adopted 
by our algorithms while they can also use the MIFS and xyt 
extension techniques we used in this paper. The most com¬ 
parable one is Lan et al.[18], with which we enhance our 
approaches. Karpathy [15] trained ConvNets on 1 million 
weakly labeled YouTube videos and reported 65.4% MAcc 
on UCF101 datasets. Simonyan & Zisserman [31] reported 
results that are competitive to IDT by training deep con¬ 
volutional neural networks using both sampled frames and 
optical flows and get 59.4% MAcc in HMDB51 and 88.0% 
MAcc in UCF101, which are comparable to the results of 
Wang & Schmid [37], Peng et al. [25] achieves similar 
results to ours on the HMDB51 datasets by combining a hi¬ 
erarchical Fisher Vector with the original one. 

6.7. Evaluation of parameters 

We now move on to our characterization of performance 
on various axes of parameters for our new descriptors. We 
report results on HMDB51 and Holly wood2, as they are the 















































HMDB51 (MAcc. %) 

Hollywood2 (MAP %) 

UCFlOKMAcc. %) 

UCF50 (MAcc. %) 

Wang et al. [37] 

57.2 

Oneata et al. [23] 

63.3 

Karpathy et al. [15] 

65.4 

Arridhana et al. [2] 

90.0 

Simonyan et al. [31] 

59.4 

Wang et al. [37] 

64.3 

Wang et al. [36] 

85.9 

Oneata et al. [23] 

90.0 

Peng et al. [24] 

61.1 

Lan et al. [ 1 8] 

68.0 

Peng et al. [24] 

87.9 

Wang et al. [37] 

91.2 

Lan et al. [18] 

65.0 

Hoai et al. [10] 

73.6 

Simonyan et al. [31] 

88.0 

Peng et al. [24] 

92.3 

Peng et al. [25] 

66.8 

Fernando et al. [5] 

73.7 

Lan et al. [ 1 8] 

89.1 

Lan et al. [18] 

94.4 

Ours 

67.0 

Ours 

71.9 

Ours 

90.2 

Ours 

95.4 


Table 4: Comparison of our results to the state-of-the-art. 


two most challenging datasets according to performance. 
We study the effect of stride, receptive field size (size of 
first layer input) and number of features (output size) as in 
[3], We evaluate the performance for a parameter at a time. 
The other parameters are fixed as default values. 

Effect of stride First, we evaluate the effect of stride. In 
this experiment, we test spatial strides of 4, 8, and 16 pix¬ 
els, and temporal strides of 2 and 5. The results are sum¬ 
marized in Figure 3. Surprisingly, we find that, different 
from [3], larger strides consistently give better performance. 
Since we observe a constant performance improvement by 
increasing the spatial stride from 4 to 16 and the tempo¬ 
ral stride from 2 to 5, we attempted to increase to an even 
larger stride. However, the stride size is constrained by the 
size of the input volume. Results in Figure 4 confirms that 
the reason for larger strides giving better results is that they 
cause less overlap in convolution and hence provide less re¬ 
dundant information to the next layer network. It is a no¬ 
table fact since non-overlap scanning saves large amounts 
of training and prediction time. 

Effect of receptive field We also study the effect of re¬ 
ceptive field size, i.e, the input size of the first layer of the 
network. We test spatial sizes of 8, 16 and 24, and tempo¬ 
ral sizes of 5 and 10 and use the maximum stride that can 
cover the input volumes. For example, for a respective field 
of 8 x 8 x 10, the stride would be (8, 5). As can be seen in 
Figure 4, the receptive field size does not affect the perfor¬ 
mance as much as the stride. Overall, the (16, 5) pair works 
best. Meanwhile, larger receptive fields do not give better 
performance. Thus, we suggest using (16, 5) as it is a good 
balance between computational cost and performance. 

Dimension of features Finally, we study the dimension 
of features ranging from 50 to 300 with a step size of 50. 
The dimension of features balances the information that is 
lost and the noise that has been reduced. As can be seen in 
Figure 5, the dimension of 200 is a good trade-off given the 
size of inputs we have. This result is consistent with what 
has been suggested by [ 9], 


Train/Apply 

HMDB51 

Hollywood2 

HMDB51 

56.1 

59.3 

Hollywood2 

56.0 

61.1 


Table 5: Generalizability of Stacked ConvISA models. 


6.8. Model generalizability 

CNN models have been shown to generalize well across 
different datasets [28], It would also be interesting to see 
the generalizability of our unsupervised learned models. To 
show this, we apply the model learned from Hollywood2 
dataset to HMDB51 dataset and vice versa. The results are 
shown in Table 5, from which we can see that if we train 
on Hollywood2 and apply it to HMDB51, there is almost 
no performance lost while if we train on HMDB51 and ap¬ 
ply the model to Hollywood2, the performance drops by 
about 2%. These results indicate that if we train on difficult 
datasets like Hollywood2, the model can generalize well. 


7. Conclusions 

Different from current trend of learning video feature us¬ 
ing deep neural networks, which is computationally inten¬ 
sive and label demanding, we propose in this paper to revisit 
the traditional local feature pipeline and combine the merits 
of both data-independent and data-driven approaches. As 
an example, we present two novel local video descriptors 
for IDT, a state-of-the-art local video feature. The pro¬ 
posed descriptors are learned using Stacked ConvISA on 
gray pixel and optical flow volumes that follow the trajec¬ 
tories detected by IDT. We also design a multi-class iter¬ 
ative re-ranking technique called MIR to exploit relation¬ 
ships among classes. Extensive experiments on four real- 
world datasets show that our new descriptors, when com¬ 
bined with hand-crafted descriptors and using MIR, achieve 
or exceed state-of-the-art methods. Future work would be 
determining the appropriate depth for Stacked ConvISA. 
Additionally, we would like to study other unsupervised 
learning methods for video feature learning. 
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